User guide

User guide

This guide provides an overview of a simple geostatistical workflow, which consists of of three basic steps:

  1. Manipulation of spatial data
  2. Definition of geostatistical problem
  3. Visualization of problem solution

These steps are only shown to guide the reader, which is free to follow his/her own workflow.

Manipulating data

The workflow starts with the creation of spatial data objects, which can be loaded from disk or derived from other Julia variables. For example, given a Julia array (or image), which is not attached to any particular coordinate system:

using Plots

Z = [10sin(i/10) + j for i in 1:100, j in 1:200]

heatmap(Z, c=:bluesreds)
50 100 150 200 20 40 60 80 100 0 25 50 75 100 125 150 175 200

we can georeference the image as RegularGridData:

using GeoStats

Ω = RegularGridData{Float64}(Dict(:Z=>Z))
100×200 RegularGridData{Float64,2}
  origin:  (0.0, 0.0)
  spacing: (1.0, 1.0)
  variables
    └─Z (Float64)

The origin and spacing of samples in the regular grid can be specified in the constructor:

RegularGridData(Dict(:Z=>Z), (1.,1.), (10.,10.))
100×200 RegularGridData{Float64,2}
  origin:  (1.0, 1.0)
  spacing: (10.0, 10.0)
  variables
    └─Z (Float64)

and different spatial data types have different constructor options (see Data for more options).

We plot the spatial data and note a few differences compared to the image plot shown above:

plot(Ω)
-100 0 100 200 50 100 150 200 Z 0 25 50 75 100 125 150 175 200

First, we note that the image was rotated to match the first index i of the array with the horizontal "x" axis, and the second index j of the array with the vertical "y" axis. Second, we note that the image was stretched to reflect the real 100x200 size of the regular grid data.

Each sample in the spatial data object has a coordinate, which is calculated on demand for a given list of locations (i.e. spatial indices):

coordinates(Ω, 1:3)
2×3 Array{Float64,2}:
 0.0  1.0  2.0
 0.0  0.0  0.0

In-place versions exist to avoid unnecessary memory allocations.

All coordinates are retrieved as a ndims x npoints matrix when we do not specify the spatial indices:

coordinates(Ω)
2×20000 Array{Float64,2}:
 0.0  1.0  2.0  3.0  4.0  5.0  6.0  7.0  …   95.0   96.0   97.0   98.0   99.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     199.0  199.0  199.0  199.0  199.0

Tabular access

Spatial data types implement the Tables.jl interface, which means that they can be accessed as if they were tables with samples in the rows and variables in the columns:

Ω[1:3,:Z]
3-element Array{Float64,1}:
 1.9983341664682817
 2.9866933079506124
 3.9552020666133956

In this case, the coordinates of the samples are lost. To reconstruct a spatial data object, we need to save the spatial indices that were used to index the table:

zvals = Ω[1:3,:Z]
coord = coordinates(Ω, 1:3)

PointSetData(Dict(:Z=>zvals), coord)
3 PointSetData{Float64,2}
  variables
    └─Z (Float64)

To recover the original Julia array behind a spatial data object, we can index the data with the variable name. In this case, the size of the array (i.e. 100x200) is preserved:

Ω[:Z]
100×200 Array{Float64,2}:
  1.99833    2.99833    3.99833   …  197.998  198.998  199.998  200.998
  2.98669    3.98669    4.98669      198.987  199.987  200.987  201.987
  3.9552     4.9552     5.9552       199.955  200.955  201.955  202.955
  4.89418    5.89418    6.89418      200.894  201.894  202.894  203.894
  5.79426    6.79426    7.79426      201.794  202.794  203.794  204.794
  6.64642    7.64642    8.64642   …  202.646  203.646  204.646  205.646
  7.44218    8.44218    9.44218      203.442  204.442  205.442  206.442
  8.17356    9.17356   10.1736       204.174  205.174  206.174  207.174
  8.83327    9.83327   10.8333       204.833  205.833  206.833  207.833
  9.41471   10.4147    11.4147       205.415  206.415  207.415  208.415
  ⋮                               ⋱                                    
  3.2289     4.2289     5.2289       199.229  200.229  201.229  202.229
  2.24454    3.24454    4.24454      198.245  199.245  200.245  201.245
  1.24775    2.24775    3.24775      197.248  198.248  199.248  200.248
  0.248489   1.24849    2.24849      196.248  197.248  198.248  199.248
 -0.743268   0.256732   1.25673   …  195.257  196.257  197.257  198.257
 -1.71761   -0.717606   0.282394     194.282  195.282  196.282  197.282
 -2.66479   -1.66479   -0.664791     193.335  194.335  195.335  196.335
 -3.57536   -2.57536   -1.57536      192.425  193.425  194.425  195.425
 -4.44021   -3.44021   -2.44021      191.56   192.56   193.56   194.56 

Spatial views

Spatial data types can be viewed at a subset of locations without unnecessary memory allocations. Spatial views do not preserve the spatial regularity of the data in general.

By plotting a view of the first 10 lines of our regular grid data, we obtain a general PointSetData as opposed to a RegularGridData:

Ωᵥ = view(Ω, 1:10*100)
plot(Ωᵥ)
0 25 50 75 100 -20 -10 0 10 20 30 Z - 5 0 5 10 15

We plot a random view of the grid to emphasize that views do not preserve spatial regularity:

inds = rand(1:npoints(Ω), 100)
plot(view(Ω, inds))
-100 0 100 200 0 50 100 150 200 Z 0 25 50 75 100 125 150 175 200

Data partitions

Spatial data objects can be partitioned with various efficient methods. To demonstrate the operation, we partition our spatial data view into balls of given radius:

Π = partition(Ωᵥ, BallPartitioner(5.))
plot(Π)
0 25 50 75 100 -20 -10 0 10 20 30 Z - 5 0 5 10 15

or, alternatively, into two halfspaces:

Π = partition(Ωᵥ, BisectFractionPartitioner((1.,1.), 0.5))
plot(Π)
0 25 50 75 100 -20 -10 0 10 20 30 Z - 5 0 5 10 15

Spatial partitions are (lazy) iterators of spatial views, which are useful in many contexts as it will be shown in the next section of the user guide. To access a subset of a partition, we use index notation:

plot(Π[1])
0 10 20 30 40 50 -10 0 10 20 Z - 5 0 5 10 15
plot(Π[2])
50 60 70 80 90 100 -10 0 10 20 Z - 5 0 5 10 15

Defining problems

Having defined the spatial data objects, we proceed and define the geostatistical problem to be solved. In this guide, we illustrate geostatistical learning. For other types of geostatistical problems, please check the Problems section of the documentation.

Let's assume that we have spatial data with some variable that we want to predict in a supervised learning setting. We load the data from a CSV file, and inspect the available columns:

using CSV

df = CSV.read("data/agriculture.csv")

first(df, 5)

Columns band1, ..., band4 represent four satellite bands for different locations (x,y) in this spatial region. The column crop has the crop type for each location that was labeled manually with the purpose of training a learning model.

Because the labels are categorical variables, we need to inform the framework the correct categorical type:

using CategoricalArrays

df.crop = categorical(df.crop)

first(df, 5)

We can now georeference the data as a GeoDataFrame and plot some of the spatial variables:

Ω = GeoDataFrame(df, [:x,:y])

gr(format=:png) # hide
plot(Ω, variables=[:band4,:crop], ms=0.5, mc=:viridis, size=(800,400))

Similar to a generic statistical learning workflow, we split the data into "train" and "test" sets. The main difference here is that our spatial split function accepts a separating plane specified by its normal direction (1,-1):

Ωs, Ωt = split(Ω, 0.2, (1.,-1.))

plot(domain(Ωs), ms=0.5)
plot!(domain(Ωt), ms=0.5, mc=:green)
-100 0 100 200 0 50 100 150 200

We can visualize the domain of the "train" (or source) set Ωs in black, and the domain of the "test" (or target) set Ωt in green. We reserved 20% of the samples to Ωs and 80% to Ωt.

Let's define the learning task and the geostatistical learning problem. We want to predict the crop type based on the four satellite bands. We will train the model in Ωs where labels are available, and apply it to Ωt, which is our target:

features = [:band1,:band2,:band3,:band4]
label    = :crop

task = ClassificationTask(features, label)

problem = LearningProblem(Ωs, Ωt, task)

GeoStats.jl is integrated with the MLJ.jl project, which means that we can solve geostatistical learning problems with any classical learning model:

using MLJ

@load DecisionTreeClassifier

model = DecisionTreeClassifier()

solver = PointwiseLearn(model)
PointwiseLearn
  └─model ⇨ DecisionTreeClassifier @ 1…60

In this example, we selected a PointwiseLearn strategy to solve the geostatistical learning problem. This strategy consists of applying the learning model pointwise for every point in the spatial data:

Ω̂t = solve(problem, solver)

Plotting solutions

First, we note that the solution to a geostatistical learning problem is a spatial data object. We can inspect it with the same methods already described:

Ω̂t[1:5,:crop]

This also means that we can plot the solution directly, side by side with the true label in this synthetic example:

p̂ = plot(Ω̂t, ms=0.5, mc=:viridis, title="crop (prediction)")
p = plot(Ωt, variables=[:crop], ms=0.5, mc=:viridis)

plot(p̂, p, size=(800,400))

Visually, the learning model has succeeded predicting the crop. We can also estimate the generalization error of the geostatistical solver with spatial validation methods such as block cross-validation and leave-ball-out, but these methods deserve a separate tutorial.

With this example we conclude the user guide. To get familiar with other features of GeoStats.jl, please check the Tutorials and the Reference guide.